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COMPRESSED NONNEGATIVE MATRIX 
EACTORIZATION IS EAST AND ACCURATE 

Mariano Tepper, Guillermo Sapiro 


Abstract —Nonnegative matrix factorization (NMF) has an 
established reputation as a useful data analysis technique in 
numerous applications. However, its usage in practical situations 
is undergoing challenges in recent years. The fundamental factor 
to this is the increasingly growing size of the datasets available 
and needed in the information sciences. To address this, in this 
work we propose to use structured random compression, that is, 
random projections that exploit the data structure, for two NMF 
variants: classical and separable. In separable NMF (SNMF) the 
left factors are a subset of the columns of the input matrix. 
We present suitable formulations for each problem, dealing with 
different representative algorithms within each one. We show 
that the resulting compressed techniques are faster than their 
uncompressed variants, vastly reduce memory demands, and 
do not encompass any significant deterioration in performance. 
The proposed structured random projections for SNMF allow 
to deal with arbitrarily shaped large matrices, beyond the 
standard limit of tall-and-skinny matrices, granting access to 
very efficient computations in this general setting. We accompany 
the algorithmic presentation with theoretical foundations and 
numerous and diverse examples, showing the suitability of the 
proposed approaches. 

Index Terms —Nonnegative matrix factorization, separable 
nonnegative matrix factorization, structured random projections, 
big data. 

1. Introduction 

The number and diversity of the fields that make use of data 
analysis is rapidly increasing, from economics and marketing 
to medicine and neuroscience. In all of them, data is being 
collected at an astounding speed: databases are now measured 
in gigabytes and terabytes, including trillions of point-of-sale 
transactions, worldwide social networks, and gigapixel images. 
Organizations need to rapidly turn these terabytes of raw data 
into significant insights for their users to guide their research, 
marketing, investment, and/or management strategies. 

Matrix factorization is a fundamental data analysis tech¬ 
nique. Whereas its usefulness as a theoretical tool is beyond 
doubt now, its usage in practical situations has undergone a few 
challenges in recent years. Among other factors contributing to 
this are new developments in computer hardware architecture 
and new applications in the information sciences. 

Perhaps the key aspect is that the matrices to analyze 
are becoming astonishingly big. Classical algorithms are not 
designed to cope with the amount of information present in 
these large-scale problems. We may even hypothesize that, 
if proper tools for these problems were widely available for 
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commercial computer power, such rich datasets would be 
created at an increasing speed. 

In this big data scenario, data communication is one of 
the main performance bottlenecks for numerical algorithms 
(here, we mean communication in a broad sense, including 
for example, network transfers and secondary memory access). 
Since the data cannot be easily stored in main memory, 
performing fewer passes over the original data, even at the cost 
of more fioating-point operations, may result in substantially 
faster techniques. 

Lastly, the architecture of computing units is evolving 
towards massive parallelism (consider, for example, general 
purpose GPUs and MapReduce models [1]). Numerical algo¬ 
rithms should adapt to these environments and exploit their 
benefits for boosting their performance. 

In recent years. Nonnegative Matrix Factorization 
(NMF) [2] has been frequently used since it provides a 
good way for modeling many real-life applications (e.g., 
recommender systems [3] and audio processing [4]). NMF 
seeks to represent a nonnegative matrix (i.e., a matrix with 
nonnegative entries) as the product of two nonnegative 
matrices. One of the reasons for the method’s popularity is 
that the use of non-subtractive linear combinations renders 
the factorization, in many cases, easily interpretable. The goal 
of this work is to develop algorithms, based on structured 
random projections, for computing NMF for big data matrices. 

A. Two flavors of nonnegative matrix factorization 

Given an m x n nonnegative matrix A, NMF is formally 
defined as 

min ||A-XY||" s.t. X,Y>0, (1) 

where r is a parameter that controls the size of factors X and 
Y and, hence, the factorization’s accuracy. For simplicity, we 
use B > 0 to denote a matrix B with nonnegative entries. 

Despite its appealing advantages, NMF does present some 
theoretical and practical challenges. In the general case, NMF 
is known to be NP-Hard [5] and highly ill-posed [6, and 
references therein]. However, there are matrices that exhibit 
a particular structure such that NMF can be solved efficiently 
(i.e., in polynomial time) [7]. 

Definition 1. A nonnegative matrix A is r-separable if there 
exists an index set 1C of cardinality r over the columns of A 
and a nonnegative matrix Y G such that 

A = (A):kY, 


( 2 ) 


2 


where (A):^: represents the matrix obtained by horizontally 
stacking the columns of A indexed by JC. Consequently, 
a nonnegative matrix A is near r-separable if it can be 
represented as 

A = (A):^Y + N, (3) 

where N is a noise matrix. 

When A presents this type of special structure, the NMF 
problem (now denoted as separable NMF, SNMF) can be 
simply modeled as 

zt^lC = r, 

min IIA — (A).x:Y|| p s.t. (4) 

where the choice of the Frobenius norm corresponds to a 
Gaussian noise matrix N. Having a more constrained structure 
for the left factor (i.e., X = (A).^;) makes the problem 
significantly easier to solve, improving the stability and the 
speed of the involved algorithms. 

B. Structured random projections 

In recent years, we have seen an increase in the popularity of 
randomized algorithms for computing partial matrix decompo¬ 
sitions. These partial decompositions assume that most of the 
action of a matrix occurs in a subspace. The key observation 
here is that such a subspace can be identified through random 
sampling. After projecting the input matrix into this subspace 
(i.e., compressing it), the desired low-rank factorization can 
be obtained by manipulating deterministically this compressed 
matrix. In many cases, this approach outperforms its classi¬ 
cal competitors in terms of accuracy, speed, and robustness. 
See [8] for a thorough review of these techniques. 

C. Contributions and organization 

We propose an algorithmic solution for computing struc¬ 
tured random projections of extremely large matrices (i.e., 
matrices so large that even after compression they do not fit in 
main memory). This is useful as a general tool for computing 
many different matrix decompositions (beyond NMF, which 
is the particular focus of this work). Our approach leads to 
the implementation of compression algorithms that perform 
out-of-core computations (i.e., loading information in main 
memory only as needed). 

We propose to use structured random projections for NMF 
and show that, in practice, their use implies a substantial 
increase in speed. This performance boost does not come at the 
price of significant errors with respect to the uncompressed so¬ 
lutions. We show this for representative algorithms of different 
NMF approaches, namely, multiplicative updates [9], active set 
method for nonnegative least squares [10], and ADMM [11]. 

We present a general SNMF algorithm based on structured 
random projections, reaching to similar conclusions as in the 
general NMF case. While there are in the literature very 
efficient SNMF algorithms for tall-and-skinny matrices [12], 
we show that, when the rank of the desired decomposition is 
lower than the number of columns of the input matrix, the 
proposed algorithm is substantially faster than its competitors. 


Interestingly, the use of structured random projections allows 
to compute SNMF for arbitrarily large matrices, eliminating 
the tall-and-skinny requirement while preserving efficiency. 
Our code is available at http://www.marianotepper.com.ar/ 
research/cnmf. 

The remainder of the paper is organized as follows. In 
Section II we provide an overview of random projection 
methods for matrix factorization and provide some theoretical 
results relevant to this work. In sections III and IV we propose 
a set of techniques for using random projections for NMF and 
SNMF, respectively. Extensive experimental results on diverse 
problems are presented in Section V, studying the performance 
of the proposed techniques on both medium and large-scale 
problems. Finally, we provide some concluding remarks in 
Section VI. 

II. On randomization and matrix decompositions 

In this section we begin by describing the random projec¬ 
tion algorithm used throughout this work. We also present 
theory that provides some guarantees for the use of random 
projections in matrix decomposition (in this work we use in¬ 
terchangeably projection or compression). Finally, we discuss 
the performance limits of the algorithm when dealing with big 
data and introduce a way to overcome such limitations. 

In problems (1) and (4), the rank of the desired matrix 
factorization is prespecified. In the following, we will thus 
assume that we are given a matrix A, a target rank r, and an 
oversampling parameter rov (its role will become clear next). 

We define a Gaussian random matrix Fi as a matrix whose 
entries are drawn independently from a standard Gaussian 
distribution, i.e., each entry is a realization of an 

independent and identically distributed random variable with 
distribution A/’(0,1). 

The overall approach to matrix factorization presented in [8] 
consists of the following three steps: 

1) Compute an approximate basis for the range of the 
input matrix A: we construct a matrix Q, with r -f rov 
orthonormal columns (i.e., Q^Q = I, where I is the 
(r + rov) X (r + rov) identity matrix), for which 

||A-QQ'^A|| « min ||A - Z ||2 = CTr+i, (5) 

^ rank(Z)<r 

where (jj denotes the j-th largest singular value of A. In 
other words, QQ^A is a good rank-r approximation of 

A. 

2) Compute a factorization of Q^A. 

3) Multiply the leftmost factor of the decomposition by Q, 
all other factors remain unchanged. 

Throughout this paper, we will use the algorithm in Fig. 1 for 
performing Step (5). For more details about this algorithm, 
we refer the reader to [8]. Since the algorithm exploits the 
structure in A, trying to find a subspace were the majority of 
its action happens, we will refer to this technique as structured 
random compression. 

In the following, we present some results from [8] that 
demonstrate the nice theoretical characteristics of the com¬ 
pression matrix Q, obtained with the algorithm in Fig. 1. Let 
E denote the expectation with respect to the random matrix. 
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input : a matrix A G a target rank r G N+, an 

oversampling parameter rov G (r + row < m), 
exponent w eN. 

output: a compression matrix Q G ^ 

1 Draw a Gaussian random matrix CIl 

2 Form the matrix product 

B = (AA^)"^ A^^; 

3 Let Q be an orthogonal basis for B, obtained using the QR 
decomposition; 


^mX(r+rov) foj- A. 
^nX (r+rov) . 


Fig. 1. Structured random compression algorithm. 


Theorem 1 ([8]). Given a matrix A G a target rank 

r G N+, and an oversampling parameter rov ^ N+ fr + 
rov ^ 'm), execute the algorithm in Fig. 1 with w = {) (no 
power iterations). We obtain a matrix Q G 
P = QQ^. Then, 

E||A-PA||^<(i+;:^)'^'('^ctA , (6) 

\3>r / 


E 


|A-PA||2< 


+ ^min{m, n} cr^+i. (7) 


Note that {T,j>r ^j) 
errors, see Equation (5). 


and (Tr-\-i are the smallest possible 


Theorem 2 ([8]). Frame the same hypotheses of Theorem 1. 
Assume rov ^ 4. Then, \/u^t > 1, 

||A-PA||p<(l+iGi2GG;)'^' + 

\j>r J 


..y- eVr+rov 

rov +1 


( 8 ) 


with failure probability at most bt ^^^^ + 26 We also have 

||A - PAII 2 < 3Vr + rov ( E! 


vj>r 


( 1 + cTr+i, (9) 

with failure probability at most b{roy)~'^^^. 


Beyond proving that the achieved error is very close to 
the optimal error, the above theorems provide a theoretical 
justification for the oversampling parameter rov It grants more 
freedom in the choice of Q, crucial in the effectiveness of 
Step (2) [8]. This freedom allows the probability of failure to 
decrease exponentially fast as rov grows. 


Theorem 3 ([8]). Given a matrix A G a target rank 

r G N+, an oversampling parameter rov ^ fr + rov ^ tu), 
and an exponent tr G N, execute the algorithm in Fig. 1. We 
obtain a matrix Q G p = QQ^. Then, 

E11A-PA1I2 < CTr+l, ( 10 ) 

where 

C = 1 + fffw) + - k. ( 11 ) 


As we increase the exponent w, the power scheme drives 
the extra factor in the error to one exponentially fast. As noted 
in [8], finding an analogous bound for the Frobenius norm is 
still an open problem. 

Throughout this work we use a Gaussian test matrix Ft. 
Other alternative test matrices can be used in its place, 
such as the subsampled randomized Hadamard and Fourier 
transforms [8, 13]. The product AFl can be significantly faster 
when using a test matrix obtained with these transforms, 
giving an automatic speedup. From this perspective, all the 
experimental results in this paper present a worst case scenario 
with respect to running times. 

Note. An alternative to structured random compression would 
be to just left-multiply A by a Gaussian random matrix Ft. 
Let us define the compression matrix Qo G as 

Qn = s-i/2 n, (12) 

where is a Gaussian random matrix Then, instead of com¬ 
puting measures with the data matrix A on the m-dimensional 
space, the much smaller matrix Qq A can be used to compute 
approximations in the 5-dimensional space. It is well studied 
that Gaussian projection preserves the £2 norm [e.g., 14, 
and references therein]. However, our extensive experiments 
show that structured random compression achieves better 
performance than Gaussian compression. Intuitively, Gaussian 
compression is a general data-agnostic tool, whereas structured 
compression uses information from the matrix (an analogous 
of training). Theoretical research is needed to fully justify this 
performance gap. 

A. Big data algorithmic solutions 

By design, the product in line 2 of the algorithm in Fig. 1 
forms a tall and skinny matrix B G ^ where m ^ 

(r + rov)- We have thus successfully reduced the number of 
columns in B from n to r+rov- While matrix A may not fit in 
main memory, we can still perform the necessary computations 
using B without significant loss of precision. 

An interesting question arises when working with large 
matrices: what happens if the number of rows m is so large 
that even B does not fit in main memory? Assuming that we 
need to store B in secondary memory (i.e., the hard drive), 
how do we compute its QR decomposition (line 3 of the 
algorithm in Fig. 1)7 

A suitable and efficient algorithm to address the latter 
question is the direct TSQR (tall-and-skinny QR) [12]. For 
completeness, we give its outline in Appendix A. The highlight 
of TSQR is that it is designed for being parallelizable while 
minimizing the dependencies between parallel computations 
(i.e., communication costs). Thus, it adheres perfectly to the 
main mantra of this work. 

An interesting byproduct of using TSQR is that there is 
no need to form the entire matrix B in main memory. See 
Appendix A for further details. This allows to implement 
an out-of-core version of the compression algorithm, that is, 
where the involved matrices do not reside in main memory. 

Let us note that the use of TSQR for computing random 
compression is introduced in this paper for the first time. 
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providing a true scalable solution for computing many types 
of matrix decompositions (i.e., beyond NMF) when both the 
number of rows and columns of the input matrix are large. 

B. Matrix decompositions with alternative norms 

The algorithm in Fig. 4 works under the Frobenius and 
nuclear norms, as detailed in the theorems presented above. 
These two cases already cover a significant range of matrix 
decompositions that are commonly used in practice. 

However, other norms are becoming increasingly popular 
in recent years. For example, NMF is widely used in au¬ 
dio processing with the Itakura-Saito distance instead of the 
Frobenius norm in Problem (1). The entry wise norm is 
also very popular when the input matrix A is contaminated 
with impulsive noise. In these cases, proper structured random 
projection algorithms need to be used, adapted to the right type 
of measure for the application at hand. 

In particular, we are currently investigating the use of the 
framework here developed for NMF under an norm. In 
such a case, the fast Cauchy transform appears as a suitable 
alternative for the task [15]. 


input : a matrix A G a target rank r G N+, an 

oversampling parameter rov G 
{r + rov < min{m, n}), an exponent w eN. 
output: nonnegative matrices G G 

1 Compute compression matrices L G 
R G R(^+^ov)xn. 

2 k ^ 1; 

3 Initialize Y^; 

4 repeat 

5 Yfc^YfeR'T; 

6 Find Xfc+i e X^+i > 0, such that 

||A - Xfc+iYfe||^ < ||A - XfcYfcll^ ; 

7 X^-i_i i L X^-|_i, 

8 Find Yfe+i G Yfc+i > 0, such that 

II ^ ^ 11 ^ II ^ ^ iP 

||A — Xfc+iYfc+i||^ < ||A — Xfc+iYfc||^ ; 

// The optimizations in lines 6 and 8 can 
be performed using any variant of 
multiplicative updates or any 
nonnegative least squares method. 

9 k ^ k 1; 

10 until convergence’. 


III. Randomly compressed NMF 

The goal of this section is to efficiently solve Problem (1) 
for large input matrices. We do not aim at developing a new 
NMF algorithm, but rather to illustrate how structured random 
projections can be used to enhance the speed of existing 
algorithms and make them usable for big data. As detailed 
in Section V, this speedup does not come at the price of 
significantly higher reconstruction errors. 

Most NMF algorithms work by iterating the following two 
steps: 

• Find X/c+i G X/^+i > 0, such that 

||A-Xfe+iYfc||^< ||A-XfcYfe||P (13a) 

• Find Y/c+i G Y/^+i > 0, such that 

||A - Xfe+iYfe+ill^ < ||A - Xfc+iYfell^ . (13b) 

This general formulation encompasses different particular al¬ 
gorithms such as multiplicative updates [9] and several variants 
of alternating nonnegative least squares [10, 16, 17]. The latter 
consists of a particular case of Algorithm (13b) in which its 
right-hand sides are minimized to the end. We thus obtain the 
following algorithm: 


Xfe+i = argmin ||A - XY^H^ 

Xgjjmxr 

S.t. X > 0, 

(14a) 

Yfe +1 = argmin ||A - Xfe+iY||p 

s.t. Y > 0. 

(14b) 




Let us assume that we apply the algorithm in Fig. 1 to 
A and A^ and obtain two matrices L G G 

]^(r+rov)xn^ respectively. By construction, L and R have 
orthonormal columns and rows, respectively. Also let A = 

art, a = L^A. 

Using matrices L and R, we propose to approximate 
Algorithm (13b) with the iterations 


Fig. 2. NMF using structured random compression. 


• Find X/c +1 G X/^+i > 0, such that 

||A-Xfc+iYfeRT||^ < ||A-XfeYfcRT||^. (15a) 

• Find Y/c+i G Y/^+i > 0, such that 


A-L^Xfe+iY^+i 


< 


A-L^Xfc+iYfe 


. (15b) 


Equivalently, using L and R, we propose to approximate 
Algorithm (14b) with the iterations 


X/c+i = argmin ||A — XY/^R^ 


|2 

If 


Yfc +1 = argmin 

Ygjjrxn 


A - L'^Xk+iY 


s.t. X > 0, (16a) 
s.t. Y > 0. (16b) 


The algorithm in Fig. 2 contains an overview of the pro¬ 
posed NMF algorithm using structured random compression. 
For our experiments regarding the techniques described in 
Section III, as representative examples of Algorithm (13b) 
and Algorithm (14b), we respectively use the active set 
method [10] and the multiplicative updates in [18, Eq. (8)]. 

We achieve a significant size reduction of the matrices in 
algorithms (15b) and (16b). For each of these algorithms, 
we reduced the number of columns from n to r -f rov in 
equations (15a) and (16a) and the number of rows from m to 
r -h rov in equations (15b) and (16b). This makes the system 
much faster to solve, but more importantly in our context, 
it greatly reduces the cost of data communication in parallel 
frameworks. For example, after compression, large matrices 
might fit in GPU memory. 

Alternatively, Problem (1) can be equivalently re-formulated 
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as 


min ||A-XY 


s.t. 


U = X, V = Y, 
U,V > 0. 


(17) 


Again, using the matrices L and R defined above, we 
propose to approximate Problem (17) with 


min LL^ (A 


U = X, 

XY)R'^R||p s.t. V = Y, (18) 
U,V >0. 


Let A = L^AR^, X = L^X, and Y = YR^. We propose 
to further approximate Problem (17) with 


1) Extract r columns of A, indexed by JC. The literature 
usually refers to them as extreme columns. 

2) Solve 

Y = argmin || A — (A):^: H||^ s.t. H > 0. (20) 

The literature on SNMF has mainly focused on Step (1) of 
the above algorithm. There are several types of algorithms for 
performing this task [21-24]. As for Step (20), Problem (20) 
involves solving n nonnegative least squares problems sepa¬ 
rately, i.e., 

(Y).i = argmin ||A.i - h||p s.t. h > 0. (21) 

heM"" 


mm 

^^^ir+roY)Xr 


A-XY 


U = LX, 

s.t. V = YR, (19) 
U,V>0. 


The alternating direction method of multipliers (ADMM) 
can be used for solving Problem (17) [11]. Thus, a similar 
technique can solve Problem (19). The details of the proposed 
algorithm are presented in Appendix B. 

The level of compression in Problem (19) is significantly 
higher than in algorithms (15b) and (16b). The latter for¬ 
mulations only employ (alternated) single-sided compression, 
whereas the former uses a (simultaneous) double-sided com¬ 
pression. One may be inclined to think that such an aggressive 
compression might lead to greater errors; however, in practice, 
this is not the case. Studying this behavior from a theoretical 
standpoint might shed light into this interesting characteristic. 


A. Limits of NMF for big data 

When matrix A gets sufficiently large, solving Problem (1) 
becomes challenging. The compression techniques here pre¬ 
sented significantly alleviate the problem for in-core compu¬ 
tations and are easily extensible for out-of-core computations. 
For example, each iteration of the multiplicative updates algo¬ 
rithm can be implemented on a MapReduce framework [19]; 
its structured compressed version can be easily adapted in this 
framework, greatly reducing communication costs thanks to 
the use of smaller matrices. Implementing our compressed 
ADMM algorithm on a MapReduce framework is just as 
straightforward. 

However, when dealing with large volumes of data, the 
practical problem actually resides in the iterative nature of 
the algorithms. As an example, consider that the execution 
time of a single iteration of the multiplicative algorithm on 
a MapReduce framework is measured in hours for sparse 
matrices with millions of columns and rows [19, 20]. As 
expected, the issue is hugely exacerbated for dense matrices. 

IV. Randomly compressed separable NMF 

Following Definition 3, let us now assume that matrix 
A is (near) r-separable. Most state-of-the-art techniques for 
computing SNMF, see Problem (4), are based on the following 
two-step approach: 


This makes Step (20) trivially parallelizable. 

Let Q G be an orthonormal basis for A G 


Q^A = 


R 

0 ’ 


Q^(A).k; = 


0 


( 22 ) 


where R G A key observation here is that the zero 

rows do not provide information for finding extreme columns 
of A [12]. We also trivially have that, for any orthonormal 
matrix Q G 


QT (A-XY)||^oc||A-XY||^. 

(23) 

r = argmin ||A - (A):k:H||^ 

H>0 

(24a) 

= argmin (A - (A).k;H)||^ 

H>0 

(24b) 

= argmin ||R - (R);k:H||^ . 

(24c) 


H>0 


Notice that Problem (24c) has succeeded to reduce the problem 
size to n X n from the original m x n Problem (20). We then 
obtain the following three-step algorithm [12]: 

1) Compute Q using, e.g., a QR decomposition of A. 

2) Find r extreme columns of R = Q^A, indexed by 1C. 

3) Solve 

Y = argmin ||R — (R):x: H||^ s.t. H > 0. (25) 

HGR^xn 


As the main assumption in NMF and SNMF is that A has 
(or can be approximated by) a low-rank structure, by all practi¬ 
cal means we expect that r C min(m, n); otherwise, it would 
not even make sense to try these type of decompositions. We 
claim that little to no information is lost by replacing the full 
orthonormal basis with a rank-preserving basis that projects 
the data into a lower-dimensional space. 

As the reader might be already suspecting, we propose to 
obtain such a basis via the use structured random projections. 
This involves a small but conceptually important change in the 
above SNMF algorithm. Replace Step (1) by 
1) Compute a structured random compression matrix Q for 
A. 

The proposed algorithm is depicted in Fig. 3. Let us now detail 
the main differences with the QR-based algorithm. 
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input : a matrix A G a target rank r G N+, an 

oversampling parameter rov G 
(r + rov < min{?Ti, n}), an exponent w eN. 
output: index JC of the extreme columns of A, nonnegative 
matrix Y G 

1 Compute Q G R"'x("’+"’ov) ^^j^g the algorithm in Fig. 1; 

2 Find r extreme columns of R = Q^A, indexed by JC; 

3 Solve Vz G [0, n) 

(Y):i ^ argmin ||R:i - (R):k: h|||. s.t. h > 0; (26) 

hGM^+'^ov 


Fig. 3. SNMF using structured random compression. 

First, let us note that R = Q^A is now an (r + rov) x ^ 
matrix instead of an n x n matrix. This allows to process 
matrices that have many more columns, as storing R has 
become orders of magnitude easier/cheaper. Also note that 
each nonnegative least squares problem in Problem (21) has 
also become orders of magnitude smaller and thus faster to 
solve. Again, the huge decrease in communication costs for 
parallel implementations is even more important in our context 
than the gain in computational speed. 

Second, the computation of the basis itself has become 
much faster. This is easy to understand when we compare the 
algorithm in Fig. 1, which only computes the QR decomposi¬ 
tion of an m X (r + rov) matrix, with the QR decomposition of 
the full m X n matrix. Of course, as the ratio r/n decreases, 
the proposed algorithm becomes faster. 

Let us assume for a moment that n is sufficiently small 
such that we can use the TSQR algorithm directly on the input 
matrix A, but not trivially small. As detailed in Appendix A, 
the QR decomposition in Equation (30) in the appendix is 
the only centralized step in TSQR; the amount of information 
that needs to be transmitted to carry this step is, again, 
orders of magnitude smaller when using structured random 
compressions. 

Note. The separable NMF model is similar to the model pre¬ 
sented in [25] (and in [26] without non-negativity constraints) 

^mm J|A-AT||^ + A||TL.o s.t. T > 0, (27) 

where ||T||j.q^ q denotes the number of non-zero rows. The 
similarity resides in that selecting a subset of rows from T 
is equivalent to selecting a subset of columns from A. This 
problem can be relaxed into a convex problem by replacing the 
£o pseudo-norm by a (possibly weighted) £i norm. However, 
whichever optimization technique we choose for solving this 
problem, it will involve an iterative algorithm, where an n x n 
system is solved in every iteration. In [25], the problem is 
shrank by clustering the columns of A and feeding a new 
matrix, only containing the cluster centers, into Equation (27). 
Eor this reasons, in our view, the SNME model, as presented 
here, presents a cleaner and faster alternative to Equation (27). 

V. Experimental results 

We will now present numerous examples supporting the use 
of structured random projections for NME and SNME, both in 
terms of speed and accuracy. 



Fig. 4. Performance of out-of-core compression. We tested different values 
for m, while fixing n = 500 . The out-of-core algorithm for structured random 
compression presented in Section II is slower for matrices with approximately 
less than 2 • 10 "^ rows. For larger matrices, it exhibits the same complexity as 
the in-core one (linear in m). Out-of-core computations do not come at the 
price of a significantly slower compression algorithm, though permit to work 
with significantly larger matrices. 

Before jumping to these problems, in Eig. 4 we show a 
simulation of the nice properties of the out-of-core com¬ 
pression algorithm presented in Section II. We performed 
our tests on m x n matrices with Gaussian entries, where 
different values for m were tested, ranging from 10^ to 10^, 
and n = 500 in all cases. This ensures that all matrices fit 
in main memory, allowing (1) to compress them with the 
in-core algorithm, and (2) to disregard disk access times, 
making the comparisons fair. The out-of-core algorithm for 
structured random compression is slower for matrices with 
approximately less than 2 • 10^ rows; for these small matrices, 
the overhead of processing the matrix per blocks becomes 
evident (notice though that both computing times are well 
under 1 second). Eor larger matrices, the overhead’s impact 
becomes less significant, and both algorithms exhibit the same 
overall performance (linear in m). In summary, we observe the 
expected behavior: the greater flexibility of the proposed out- 
of-core compression algorithm for processing large matrices 
does not cause performance to degrade with respect to the 
in-core one. 

A. NMF 

Eor our experiments regarding the techniques presented in 
Section III, as representative examples of Algorithm (13b) 
and Algorithm (14b), we respectively use the active set 
method [10] and the multiplicative updates in [18, Eq. (8)]. Eor 
these two algorithms, we compared with a vanilla version and 
a variant using Gaussian projection, as presented in [27] (also 
see Section II). We also implemented the ADMM algorithm 
in [11] and the proposed ADMM algorithm with structured 
random compression. All the methods were implemented in 
Matlab. In all tests, we set ic = 4 and rov = 10 in the 
compression algorithm in Eig. 1; we further adjust the value 
of rov so that r + rov = min(max(20, r + rov), 
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TABLE L Performance when compressing a PET image. See Fig. 8 for 
a detailed explanation of the setup. As we can observe, the use of Gaussian 
compression (GC) is detrimental to the reconstruction error (higher values 
indicate a lower error), while the proposed structured random compression 
(SC) has no significant impact on it. As a counterpart, the use of SC 
significantly decreases the computing time with respect to the original method. 
The best values for each column are highlighted in green. 



Error (- log^o) 

Mean STD Median 

Time (s) 

Multiplicative 

3.628 

3.435 

3.980 

177.813 

Multiplicative - GC 

3.496 

3.304 

3.834 

19.742 

Multiplicative - SC 

3.626 

3.433 

3.979 

71.437 

ADMM 

3.638 

3.436 

4.027 

32.195 

ADMM - SC 

3.636 

3.433 

4.028 

23.168 

Active set 

3.628 

3.436 

3.976 

18.251 

Active set - GC 

3.536 

3.350 

3.882 

14.277 

Active set - SC 

3.638 

3.436 

4.024 

11.371 

ALS* 

3.638 

3.436 

4.023 

18.162 

ALS with proj. grad.* 

3.634 

3.436 

4.004 

58.208 


^ Obtained from http://cogsys.imm.dtu.dk/toolbox/nmf/. 


We begin by showing in Fig. 7 simulations results of 
the different NMF variants on synthetic examples. The first 
interesting observation from these examples is that, although 
the computation of the compression matrix is more costly for 
structured than for Gaussian compression, this might not end 
up reflected in the overall computing time; this is because, in 
general, the NMF variant with Gaussian compression requires 
more iterations to converge. The second observation is that 
the NMF variants that use structured compression yield very 
similar relative reconstruction errors than their uncompressed 
counterparts (higher in one example, lower in three). For 
multiplicative updates and ADMM, the gain in speed of using 
structured compression is huge; for active set, the speedup is 
not as dramatic. Lastly, Gaussian compression seems to come 
at the cost of higher reconstruction errors. 

We also run different NMF algorithms on a hyperspectral 
positron emission tomography (PET) image, see Fig. 8. This 
example allows to visually compare the errors produced by the 
different methods. The NMF methods with Gaussian compres¬ 
sion create “clusters” of errors (particular areas in which the 
errors seem to concentrate). In Table I we show several error 
statistics and the computing time for the different methods. 
The statistics also refiect the same behavior as our visual 
previous inspection. Structured compression has a positive 
effect on the computing time (it decreases), and no significant 
effect on the error statistics. 

Climate datasets are very interesting to analyze using NMF. 
We believe that the evidence of a low rank model within 
climate data is of interest by itself. Nonnegativiy is a useful 
addition since, under this model, the effects of different factors 
cannot cancel each other. The technical details and results 
of an experiment using climate data are shown in Fig. 9. 
In this case, we only use the active set method for our 
comparisons. We found that two factors explain the data with 
enough accuracy. Both factors seem to correspond to two very 
different seasons across the globe, and they exhibit inversely 
correlated periodic patterns. While the left and right factors 
obtained using structured compression are very similar to their 
uncompressed counterparts, Gaussian compression introduces 


visible artifacts in the resulting factorization. Structured ran¬ 
dom compression also is the fastest of the three methods. 

Our last classical NMF example consists of a popular appli¬ 
cation: biclustering. In this case, we bicluster a bipartite social 
network, i.e., that contains two different types of nodes. In our 
particular example, these two types correspond to characters 
from Marvel comic books and to the comic books in which 
they appear. We performed NMF with r = 10 (recall that r is 
the number of factors). We then thresholded each column of 
X and each row of Y to obtain sparse components that we 
define as a bicluster (we could have also added a sparsity term 
to the formulation, but opted for a simpler approach that does 
not introduce additional complexity). For each column (row) 
of X (Y), we set to zero the entries smaller than the column 
(row) mean plus three standard deviations. Then, for display 
purposes, we only keep the largest 25 entries in each column 
of X if there are more than that number of nonzero entries. In 
Fig. 10 we show two of the biclusters obtained in such a way. 
It becomes quickly apparent that structured compression does 
not introduce significant artifacts in the biclusters, whereas 
the clusters found with Gaussian compression are heavily 
intertwined (all ten factors seem to be mixed together). For 
example, Mary Jane Parker-Watson, Spider-Man’s wife, is not 
a recurring character of the Fantastic Four comic books. 

To summarize, the overall observation is that structured 
compression brings additional speed to NMF methods without 
introducing significant errors. On the other hand, Gaussian 
compression seems to come at the cost of higher reconstruction 
errors and is not consistently faster than structured compres¬ 
sion. 

B. Separable NMF 

We implemented our SNMF algorithms in Python, using 
the dask and into libraries^ to perform out-of-core matrix 
computations (i.e., without fully loading the involved matrices 
in main memory). A byproduct of this implementation choice 
is that we can compute SNMF on very large matrices on a 
regular laptop, without having to resort to a cluster. To the 
best of our knowledge, our TSQR implementation is the first 
publicly available one that runs on any regular laptop using 
out-of-core computations. 

We perform all of our comparisons with the SNMF al¬ 
gorithm using the QR decomposition [12], analyzed in Sec¬ 
tion IV. We use SPA [21, 24], and XRAY [23] as the column 
selection algorithms. Throughout this section, we simply use 
compression to refer to structured compression. In all tests, 
we set ic = 0 and rov = 10 in the compression algorithm in 
Fig. 1; we further adjust the value of rov so that r + rov = 
min(max(20, r + rov), ^). 

We first present results on synthetic matrices in Fig. 11. 
We produced different matrices of fixed size by varying their 
rank, see Fig. 11(a). In general, we aim at explaining the data 
matrix with a small fraction of its columns. The proposed 
compression method for SNMF is faster when fewer factors 
are needed to explain the data. On the other hand, QR- 
based methods have always the same (high) computing time, 

^ http://dask.readthedocs.org/, http://into.readthedocs.org/ 






O Multiplicative 
...©■■■Multiplicative - GC 
-O- Multiplicative - SC 




-e— Multiplicative 
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10 ^ 


10 '^ 


10000 15000 

Matrix size 


20000 



(a) Synthetic dense matrices. The matrix size indicates the number of rows m; the number of columns n is fixed to n = 0.75m in all cases. Since the 
matrices are dense, 5 = 1. 






(a) Synthetic sparse matrices. The matrix size indicates the number of rows m; the number of columns n and the sparsity level 5 are fixed to n = 0.75m 
and 5 = 10“^ in all cases. 


Fig. 7. Performance comparison on synthetic matrices. We first generate two matrices Xgt G 


, Ygt £ where their entries are uniformly 


distributed in [0,1] with probability 5, or zero with probability 1 — 5. We then build A = Xgt Ygt + N, where the entries of N are normally distributed with 
probability 5^, or zero with probability 1 — 5^. GC and SC stand for Gaussian and structured compression, respectively. The reconstruction error is reported 
as the mean over 10 different runs. While both GC and SC are generally faster than the original uncompressed methods (top row), the accuracy levels of the 
latter are only matched (and sometimes even outmatched) by SC (bottom row). 
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Fig. 8. Reconstruction errors when compressing a positron emission tomography (PET) image (http://cogsys.imm.dtu.dk/toolbox/nmf/). The image is 
composed of 40 temporal frames, where each frame is a 128 x 128 x 35 3D image (35 is the number of 128 x 128 slices). The matrix size is then 573440 x 40 
and we perform NMF with r = 5. As we can observe in each slice (a few of them are highlighted with zoom-ins), the use of Gaussian compression (GC) 
increases the reconstruction errors, while structured random compression (SC) has no identifiable effect. See Table I for additional numerical results. 


no matter how simple is the structure of the data. We also 
investigated how much faster is the proposed method with 
respect to QR-based approaches. We generated m x n input 
matrices, where m is fixed and n varies; we then extract 
n/10 columns. Remember that QR-based approaches solve an 
nxn version of Problem (24c), while the proposed compressed 
approach solves an (r + rov) x n version. This difference is 
refiected almost exactly in the speedup that we observe in 
Fig. 11(b): about an order of magnitude is gained with the 
proposed scheme. 

In Fig. 12 we analyze the same dataset as in Fig. 9. 
Interestingly, a similar conclusion is reached using SNMF and 
NMF. The data is well explained by the same two factors 
(in this case, two extreme columns). Notice that the analyzed 
matrix is fat and the QR-based approach provides no speedup, 
i.e., R G in Problem (24c). On the other hand, the 

proposed approach produces a smaller problem independently 
of the input matrix’s shape. Quantitatively, in this example, 
compressed SNMF is two orders of magnitude faster than the 
QR-based SNMF. 

Our last example consists on an application for selecting 
representative frames from videos. We first examine a short 
clip (5 seconds long, 120 frames) of the open-source movie 
“Elephants Dream” at a resolution of 360p (640 x 360). In 
Table II we show a summary of the comparisons performed 
with this video. An example of the frames extracted by SPA 
with compression is shown in Fig. 13. 

Our first observation is that the proposed compressed SNMF 


is at least an order of magnitude faster than the QR-based 
variant. Second, since the matrix built from video is not 
truly low-rank, projecting the matrix into a low-rank subspace 
by means of compression seems to yield better results than 
when using the QR decomposition. Intuitively, compression 
eliminates some variability in the data in such a way that it 
can be better approximated by SNMF. 

Although not strictly comparable, because it does not 
impose nonnegativity constraints, we included in our com¬ 
parisons the method for extracting representative elements 
from [26]. As discussed in Section IV, this method’s formu¬ 
lation does not scale gracefully with large input matrices. A 
fact that is easily refiected in the slow running time, even for 
a relatively small example. 

Scaling to Big Data: We also run tests on the complete 
open-source movie “Elephants Dream.”^ The movie is approx¬ 
imately 11 minutes long (15691 frames). We processed the 
video at two resolutions, 360p (640 x 360), and lOSOp (1920 x 
1080), resulting in 691200 x 15691 and 6220800 x 15691 
matrices, respectively. The HDF5 files occupy 43.55 GB and 
391.13 GB, respectively, not fitting in main memory. Using 
compressed SPA, we extract 130 representatives (extreme 
columns) from the video, one every 120 frames (5 seconds). 
At 360p we obtained a relative error of 0.2941 in 1891 seconds 
(about 32 minutes). At 1080p, we obtained a relative error of 
0.2676 in 20776 seconds (about 5:46 hours) processing both 

^ http: //w w w. elephantsdream. org 
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Active set - GC Active set Active set - SC 



(a) Left factors analysis. Interestingly, the first factor (top row) corresponds to summer and winter in the north and south hemispheres, respectively, while 
the second factor (bottom row) corresponds to summer in the south and winter in the north. Visually, it is very clear that SC introduces much less artifacts 
than GC compared to the vanilla method (center). 

I - Component 1 I 

I Component 2 | 



(b) Right factors analysis. In the top row, we can easily observe that the two factors are periodic and inversely correlated, corroborating the winter/summer 
duality between both components. In the bottom row, we observe that the GC factors are much more noisy (about an order of magnitude larger), compared 
to the original and SC methods. 


Fig. 9. NMF on gridded climate data (http://www.esrl.noaa.gov/psd/repository/). The data contains daily mean surface temperatures arranged in a 144 x 73 
grid since 1948 (23742 days in total), forming a 10512 x 23742 matrix. We perform NMF using the active set method with r = 2. The computing times 
for the method in its vanilla version (center), with Gaussian compression (GC) and, with structured random compression (GC) were 50, 70, and 20 seconds, 
respectively; the respective relative reconstruction errors were 0.0459, 0.0537, and 0.0458, confirming the conclusions reached through visual inspection. 


matrices on a laptop with 16GB of memory. 

VI. Conclusions 

In this work we proposed to use structured random projec¬ 
tions for NMF and SNMF. For NMF, we presented formula¬ 
tions for three popular techniques, namely, multiplicative up¬ 
dates [9], active set method for nonnegative least squares [10], 
and ADMM [11]. For SNMF, we presented a general technique 
that can be used with any algorithm. In all cases, we showed 
that the resulting compressed techniques are faster than their 
uncompressed variants and, at the same time, do not introduce 
significant errors in the final result. 

There are in the literature very efficient SNMF algorithms 
for tall-and-skinny matrices. Interestingly, the use of structured 
random projections allows to compute SNMF for arbitrarily 
large matrices, granting access to very efficient computations 
in the general setting. 

As a byproduct, we also propose an algorithmic solution for 
computing structured random projections of extremely large 


matrices (i.e., matrices so large that even after compression 
they do not fit in main memory). This is useful as a general tool 
for computing many different matrix decompositions, such as 
the singular value decomposition, for example. 

We are currently investigating the problem of replacing 
the Frobenius norm with an Ip norm in our compressed 
variants of NMF and SNMF. In this setting, the fast Cauchy 
transform [15] is a suitable alternative to structured random 
projections. Compression consists of sampling and rescaling 
rows of A, thus identifying the so-called coreset of the 
problem. This formulation is of particular interest for network 
analysis, where we need to deal with sparse structures. 
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Spider-Man 
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Fig. 10. Biclustering the Marvel Universe collaboration network (http://www.chronologyproject.com/). The network links Marvel characters and the 
Marvel comic books in which they appear, and exhibits most characteristics of real-life collaboration networks [28]. It can be represented as an m x n 
matrix, where m = 6445 and n = 12850 are the number of characters and comics, respectively. We bicluster this matrix using NMF with r = 10, 
aiming at obtaining 10 very representative groups of characters appearing jointly in different comic books. The zth bicluster {i = 1... 10) is formed by 
the 2 th column of X and the ith row of Y (small entries were set to zero, as explained in Section V-A). The radar plots represent the coefficients of 
these vectors. We show two biclusters that we identify with characters from the Fantastic Four (first two columns of the figure) and the Spider-Man (last 
two columns of the figure) comics. Active set NMF correctly identifies that Mr Fantastic, The Thing, the Invisible Woman, and the Human Torch are the 
four most recurring characters in the “Fantastic Four” (FF) series. Similarly, active set NMF correctly identifies that Spider-Man/Peter Parker, Mary Jane 
Watson-Parker (Peter Parker’s wife), and Jonah Jameson (Peter Parker’s boss) are the most recurring characters in the “Amazing Spider-Man” (ASM) and 
“Peter Parker, The Spectacular Spider-Man” (PPTSS) series. It is clear that the biclusters recovered using structured random compression (SC) are very close 
to the biclusters found with no compression; contrarily, Gaussian compression (GC) significantly affects the biclustering result. All 10 biclusters can be found 
at http://www.marianotepper.com.ar/research/cnmf. 


Appendix 


written in matrix form as 


A. QR decompositions for tall-and-skinny matrices 

The direct TSQR algorithm uses a simple but highly effi¬ 
cient approach for computing that QR decomposition of a tall 
and skinny matrix. Let A be the m x n matrix to decompose 
(m ^ n). The direct TSQR algorithm starts by splitting A 
into a stack of b blocks 


A/c 


1 : 


A = 




( 28 ) 


where ICi: denotes the set of rows selected in the ith block. 
Each block is factorized into its components , Rx:i: 
using any standard QR decomposition algorithm. This can be 


-1 

> 

• ft 

_1 


1 

£) 

_1 


R-Ki; 






mxn 


mxbn 


bnxn 


The second step is to gather the matrix composed by vertically 
stacking the factors HjCi-. and computing an additional QR 
decomposition, i.e., 


R/Cl: 


1 

jO 

__1 

.RK(,:. 




R . 

nxn 


bnxn 


( 30 ) 


bnxn 
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(a) We extract r columns, where r is the rank of the 10® x 100 input matrix. As 
expected, the computing time of the QR-based methods does not change with 
the number of extracted columns. On the other hand, compressed methods 
are faster when the rank of the input matrix is low compared to its size. 
In this case, out-of-core methods appear slower than in-core ones (slightly 
above 2x). We use an oversampling factor rov = 10 for compression, which 
explains the flattening of the compressed curves towards their end. 



(b) We extract n/10 columns from a 10® x n input matrix. Note that 
the speedup of compressed versus QR-based methods (approx. 10 x) is 
straightforwardly explained by the fixed ratio between the rank and the number 
of columns. In this case, no significant speed difference is noticeable when 
comparing in-core with out-of-core methods. 


Fig. 11. Performance of different SNMF algorithms on synthetic matrices. We generate the input matrix A = XgtYgt, where Xgt £ and 

Ygt G normally distributed entries (r and n take different values in subfigures (a) and (b)). All algorithms select the same set of columns, thus 

producing equal errors. 




(a) Columns extracted with SPA-comp. 
When r = 2, the extreme columns 
look similar to the ones found with 
traditional NMF, see Fig. 9. 



(b) With the uncompressed methods, extracting columns becomes extremely slow (about two orders of magnitude 
slower) than with the compressed methods. Since compressed SNMF is faster with 10 columns than QR-based SNMF 
with two columns, we simply stopped the computation of the latter after 2 columns. Notice that these QR-based 
methods are explicitly designed to be faster for tall-and-skinny matrices, but end-up being extremely slow for fat 
matrices. The proposed compressed SNMF is also very fast for fat matrices. 


Fig. 12. SNMF on gridded climate data. Same dataset as in Fig. 9. The data form a fat 10512 x 23742 matrix. We study the performance of SNMF in 
terms of computing speed and relative error as the number r of columns changes. As with NMF, the data is well explained with only two factors by observing 
the decay in the reconstruction error. SPA with compression seems not to increase its computing time as the number of extracted columns increases; this is 
due to forcing the compression algorithm to produce at least 20 rows, the subsequent column extraction in SPA is extremely efficient. Notice that SPA with 
compression is about four times faster than NMF using the active set method with compression, see Fig. 9. 


This is the only centralized step in TSQR. We then multiply 
the intermediate Q factors to get the matrix 


Q = 


1 

£) 



1 

£) 

_1 




1 

<y 


1 

1_ 




■ (31) 


mxbn 


bnxn 


Finally note that A = QR, where Q is an orthonormal matrix 
(obtained from the multiplication of two orthonormal matrices) 
and R is by algorithmic design, upper triangular. Thus, these 
matrices form a QR decomposition of A. 

1) TSQR for structured random compression: When using 
TSQR for compressing a matrix A, Fig. 1, the input matrix 
to decompose is 


where w eN. Let us assume, for simplicity, that w = 0. The 
input of TSQR is not the matrix B as a whole, but blocks 
extracted from it. We can thus avoid storing the entire matrix 
B in main memory, and compute its blocks as needed, i.e., 

(33) 

A similar (but more complex) indexing holds for re > 0. 


B. An ADMM algorithm for solving Problem (19) 

We consider the augmented Lagrangian of Problem (19), 


if (x,Y,U,V,A,$) = 


A-XY 


+ 


( 


A • LX - U 


# • YR-V 


) 


LX-U 


YR-V 


B = (AA^)“ 


(32) 


F 


(34) 
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Fig. 13. Extracting representative frames from a video (http://www.elephantsdream.org). The video resolution is 640 x 360 pixels and contains 120 
frames (5 seconds). On the top block, we display 40 uniformly sampled frames. We build a 691,200 x 120 matrix by vectorizing one frame per column 
(each frame has 3 color channels), and then use SNMF with compression to extract six representative frames (bottom left). On the bottom right we show 
the (normalized) columns of the matrix H in Step (25), i.e., the reconstruction coefficients. It took 2.18 seconds to compute the result with relative errors of 
0.2714 and of 0.4240 with respect to the compressed and the original matrices, respectively. 


TABLE II. Extracting representative frames from a video For details 
about the experiment setup, see Fig. 13. We are considering a (relatively 
small) 691,200 x 120 matrix to be able to compare the performance of in- 
core and out-of-core methods and with ESV [26], which is not fit for large 
scale matrices. The proposed compression scheme for SNMF (SPA-COMP) 
greatly improves speed with no detriment for the reconstruction error. Notice 
that since the matrix is not actually low-rank (it is a video), enforcing the 
projection onto a subspace helps in finding a better solution (SPA-COMP 
versus SPA-QR). 


Methods 

Comp, model 

r 

Time (s) 

Rel. error 

SPA-COMP 

in-core 

6 

2.28 

0.4240 

SPA-COMP 

out-of-core 

6 

4.75 

0.4293 

SPA-QR 

in-core 

6 

18.76 

0.5446 

SPA-COMP 

in-core 

9 

2.31 

0.3626 

SPA-COMP 

out-of-core 

9 

4.59 

0.3610 

SPA-QR 

in-core 

9 

19.08 

0.4453 

ESV [26] (a = 2)1 

in-core 

9 

57.38 

0.375L 

SPA-COMP 

in-core 

15 

2.65 

0.3068 

SPA-COMP 

out-of-core 

15 

5.50 

0.3047 

SPA-QR 

in-core 

15 

19.93 

0.4011 

ESV [26] (a = 50)1 

in-core 

15 

68.05 

0.1358^ 


^ a is a regularization parameter that (indirectly) controls the number 
of representatives r. 

^ The errors are not directly comparable since this formulation does not 
impose nonnegativity. 


where A G ^ G are Lagrange multipliers, A, 0 G 

IR+ are penalty parameters, and B • C = j 
matrices B, C of the same size. 

We use the Alternating Direction Method of Multipliers 
(ADMM) for solving Problem (19). The algorithm works in a 
coordinate descent fashion, successively minimizing ^ with 
respect to X,Y,U, V, one at a time while fixing the others 


at their most recent values, i.e., 

X/e+i = argmin^ (x, Y/., U/c, A/., ^/e) , (35a) 

Y/c+i = argmin^ [X/,+i, Y, U/., V/., A/,, , (35b) 

U/c+i = argmin^ (X/^+i, Y/^+i, U, V/e, A/^ ) , (35c) 

u>o ^ ^ 

V/e+i = argmin^ (X/,+i, Y/,+i, U/e+i, V, A/., ) , 

v>o ^ ^ 

(35d) 

and then updating the multipliers A, Each of these steps 
can be written in closed form and define our algorithm, see 
Fig. 14. In practice, we set (a,/3,7,^ to 1. 

We now provide a preliminary convergence property of the 
proposed ADMM algorithm. Our analysis follows closely the 
one in [11, Section 2.3]. 

To simplify notation, we consolidate all the variables as 

z= (x,y,u,v,a,#). 

A point Z is a Karush-Kuhn-Tucker (KKT) condition of 


Problem (19) if 

(XY - a) Y"^ + A = 0, (36a) 

X"^ (XY - a) + $ = 0, (36b) 

LX - U = 0, (36c) 

YR - V = 0, (36d) 

A<0<U, AoU = 0, (36e) 

#<0<V, #oV = 0, (36f) 
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input : a matrix A G a target rank r G N+, an 

oversampling parameter rov G 
(r + rov < min{m, n}), an exponent w eN. 
output: nonnegative matrices 11^ G G 

1 Compute compression matrices L G 
R G R(^+^ov)xn. 


2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 

14 


k^l; 

Initialize Ufc, V^; 

A ^ L^AR^; Y ^ V^R^ 

A/c ^ 0; ^ 0; 

I ^ the r X r identity matrix 

repeat 


^fc+i 

Yfe+i 


(AY^ + AL^Ufe - L^Ak){YkY^ + AI)-^ 
(XGiXfc+i+0I)-i(X^+iA+0VfcRT-.^fcR'i 


// (.^+(B))y = max{(B)y, 0 } 

Ufc+i ^ ^+(LXfc+i +A-iAfe); 
Vfc+i ^ ^+(Yfc+iR + 0-i$fc); 

Afc+1 ^ Afc + CA(LXfe+i -Ufc+i); 

^fc+1 ^ + ^0(Yfc+iR — Vfc+i); 

/c ^ /c + 1; 


15 until convergence’. 


point 

-^OO ~ 5 Yqo , Uqo 5 '^OO 5 -^oo 7 • 

We are then left to prove that equations (36e) and (36f) hold. 
Algorithm (35d) guarantees the non-negativity of Uoo, Voo- 
Let us focus on Equation (36e) first. Equation (39e), when 
combined with Equation (39c), yields 

Uoo = (Uoo + A-^Aoo) , (40) 

If (Uoo)y = 0, we get {^+ (A“iAoo))jj = 0 and 
then (Aoo)j^- < 0. If (Uoo)^^ > 0, we get (Uoo)*^ = 
(^(Uoo)ij) and (Aoo)y = 0. From this, we obtain 
that Equation (36e) holds. An identical argument applies for 
equations (36f) and (39f). 

With this, we have proven that any accumulation point 
of is a KKT point of Problem (17). Erom the 

equivalence of problems (1) and (17), any accumulation point 
of {(X/e, Y/e)}^i is a KKT point of Problem (1). ■ 


Fig. 14. ADMM algorithm for NMF with structured random compression. Corollary 1. Whenever converges, it converges to a 

KKT point of Problem (17). 


where o denotes the Hadamard (entrywise) matrix product. 

Proposition 1. Let {Zk}^^i be a sequence generated by the 
algorithm in Fig. 14 that satisfies the condition 


Ideally, we would like to guarantee that Algorithm (35d) 
will always converge to a KKT point of Problem (19). The 
above simple result is an initial step in this direction, providing 
some assurance on the behavior of Algorithm (35d). 


lim {Zk+i — Zk) = 0. 

k^oo 

(37) 

Then any accumulation point of is a KKT point of 

Problem (19). 


Proof: Prom Assumption (37), we have 


X/e+l ~ X/c ^ 0, 

(38a) 

Y/c+l — Y/c ^ 0, 

(38b) 

Y/c+l — A/e ^ 0, 

(38c) 

^/c+1 — ^/c ^ 0, 

(38d) 

U/c+l — U/e ^ 0, 

(38e) 

Y/e+i — V/c ^0. 

(38f) 

Plugging these subtractions in the variable updates 
we get 

in Fig. 14, 

(a - XfcYfe) Yj - LAfe ^ 0, 

(39a) 

^A — Xj;_|_iYfe^ — — )■ 0, 

(39b) 

LX/e+1 — U/e+i ^ 0, 

(39c) 

Y/e+iR — V/e+i — > 0, 

(39d) 

^LX/e+1 + A ^A/e^ — U/e ^ 0, 

(39e) 

(Y/,+iR + - V/, ^ 0. 

(39f) 


Notice that the terms A ^L^U/c — and 
have been eliminated from equations (39a) and (39b) by invok¬ 
ing equations (39c) and (39d), respectively. Equations (36a- 
36d) are clearly satisfied by equations (39a-39d) at any limit 
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